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Abstract —Due to its causal semantics, Bayesian networks (BN) have been widely employed to discover the underlying data 
relationship in exploratory studies, such as brain research. Despite its success in modeling the probability distribution of variables, 
BN is naturally a generative model, which is not necessarily discriminative. This may cause the ignorance of subtle but critical 
network changes that are of investigation values across populations. In this paper, we propose to improve the discriminative power 
of BN models for continuous variables from two different perspectives. This brings two general discriminative learning frameworks for 
Gaussian Bayesian networks (GBN). In the first framework, we employ Fisher kernel to bridge the generative models of GBN and the 
discriminative classifiers of SVMs, and convert the GBN parameter learning to Fisher kernel learning via minimizing a generalization 
error bound of SVMs. In the second framework, we employ the max-margin criterion and build it directly upon GBN models to explicitly 
optimize the classification performance of the GBNs. The advantages and disadvantages of the two frameworks are discussed and 
experimentally compared. Both of them demonstrate strong power in learning discriminative parameters of GBNs for neuroimaging 
based brain network analysis, as well as maintaining reasonable representation capacity. The contributions of this paper also include 
a new Directed Acyclic Graph (DAG) constraint with theoretical guarantee to ensure the graph validity of GBN. 

Index Terms —Bayesian network, discriminative learning, Fisher kernel learning, max-margin, brain network. 
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1 Introduction 

S an important probabilistic graphical model, 
Bayesian network (BN) has been used to model 
the probability distribution of a set of random variables 
for a wide spectrum of applications, e.g., diagnosis, 
troubleshooting, web mining, meteorology and bioinfor¬ 
matics. It combines graph representation with Bayesian 
analysis, providing an effective way to model and infer 
the conditional dependency of the variables. A BN has to 
be a directed acyclic graph (DAG). Two factors character¬ 
ize a BN, i.e., the structure of the network (the presence / 
absence of edges in the graph) and the parameters of the 
probability distribution. Recent research of BN focuses 
on how to learn the structure and the parameters of BN 
directly from the data. 

The approaches of learning BN structures can be 
roughly categorized into the constraint-based, the score- 
based, and the hybrid approaches. The constraint-based 
approaches use a serie of conditional independence test¬ 
ing to ensure the model structure is consistent with the 
conditional independency entailed by the observations. 
Methods in this class include the IC algorithm (lj, PC 
algorithm 0, and more recent methods 0, (4J. Score- 
based approaches define a scoring function over the 
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space of candidate DAGs and optimize this function 
through certain search strategies. Methods in this class 
vary with scoring criteria, e.g., the posterior probabil¬ 
ity CD/ 0/ m and the minimum description length 0, or 
vary with search strategies, e.g., the heuristic search 0 
and the Monte Carlo methods j5j. Hybrid approaches 
usually employ constraint-based methods to prune the 
search space of DAG structures and consequently re¬ 
strict a subsequent score-based search flOl . fTD . Many 
existing BN learning methods, such as LIMB-DAG |12| , 
MMHC IflOl . TC and TC-bw JT3), comprise of two stages: 
the identification of candidate parent sets in the first 
stage and the further pruning of them based on certain 
criteria in the second stage. Despite the mitigation of 
computational complexity, a drawback arises that if a 
parent node is missed in the first stage, it will never 
be recovered in the second stage fT4| . To address this 
issue, one-stage learning process has been preferred in 
recent research work fl4l . [15|. In these studies, based 
on Gaussian Bayesian network (GBN), the parent sets of 
all variables are learned together to optimize a LASSO- 
based score function in a single stage. The related opti¬ 
mization problems are solved either approximately |14| 
or exactly [151 . They have demonstrated improved re¬ 
liability of BN edge identification over traditional two 
stage methods. 

Although BN is naturally a generative method, it has 
also been used in classification tasks for diagnostic or 
predictive purposes. A straightforward usage is to train 
each class a BN and classify a new sample into the class 
with the highest likelihood value lilil . Another kind of 
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approaches trains "Bayesian network classifiers" with 
discriminative objective functions |[T6| , [17), ff8| . In these 
approaches, usually a single BN is learned to optimize 
the discrimination performance. Either the structure or 
the parameters of the BN are adjusted to reflect the class 
difference for better classification. Therefore, the result¬ 
ing BN does not model the distribution of any individual 
class. The "Bayesian network classifiers" in ||T6l , (T7|, 
Ifl8l are designed for discrete variables of multinomial 
distribution. They still inherit the two-stage learning 
process, i.e., have to predefine candidate parent sets as 
mentioned above. 


Learning BN from the data faces new challenges in 
exploratory domains, such as brain research, where the 
mechanism of brain and mental diseases remain un¬ 
clear and need to be explored. These domains usually 
cater for both interpretation and discrimination. "In¬ 
terpretation" requires interpretable models of the data 
and the findings explained by domain language rather 
than mathematical terms. This requirement comes from 
the demand of understanding the domain problems. 
"Discrimination" requires the models to have sufficient 
discriminative power to distinguish groups of interest 
(such as identifying the diseased from the healthy), 
for the purpose of prediction. To some extent, a high 
accuracy of the predictive model also provides a measure 
of the amount of information captured by that model. 


Being a generative method, BN represents the dis¬ 
tribution of the data and is naturally amenable for 
interpretation. However, it is known that generative 
methods are not necessarily discriminative. They are 
prone to emphasizing major structures that are shared 
within each group, and neglecting the subtle but critical 
changes across groups. The latter, unfortunately, often 
happens, for example, in disease-induced brain changes 
across clinical groups. Consequently, generative methods 
are usually inferior in prediction compared with the 
discriminative methods that target only the boundary 
of classes (such as Support Vector Machines (SVMs)). 
On the other hand, discriminative methods often en¬ 
counter the difficulty of interpretation, which is critical 
in exploratory research aimed at both the understanding 
and the prediction. Thus, this paper is motivated by 
the advantages that can be gained by learning BNs 
that are both representative and discriminative. Different 
from the Bayesian network classifiers in fl6l . |jl7). fl8l 
that address discrete variables, we learn discriminative 
BNs for continuous variables, which is often needed 
in many domains including neuroimaging-based brain 
research. Moreover, we learn for each class a BN with 
enhanced discrimination and maintain the BN repre¬ 


sentation of each individual class for interpretatiorQ. 
To achieve our goal, we propose two discriminative 
learning frameworks based on sparse Gaussian Bayesian 
network (SGBN). 

In the first framework (termed KL-SGBN), we employ 
Fisher kernel Ifl9) to link the generative models of SGBN 
to the discriminative classifiers of SVMs, and convert 
the SGBN parameter learning to Fisher kernel learning 
via maximizing a generalization bound of SVMs. The 
contributions of this framework include the following, i) 
By inducing Fisher kernel on SGBN models, we provide 
a way to obtain sample-specific SGBN-induced feature 
vectors that can be used by the discriminative classifiers 
such as SVMs. Through this, we bridge the generative 
models and the discriminative classifiers, ii) We propose 
a kernel learning approach to discriminatively learn the 
parameters of SGBNs by optimizing the performance of 
SVM. iii) As a by-product, the manipulation of Fisher 
kernel on SGBN provides a new way of variable selec¬ 
tion for SGBNs. This framework has a computational 
advantage: through the mapping of Fisher kernel, the 
SGBN-induced feature vectors become linear functions 
of the SGBN parameters, which significantly simplifies 
the optimization problem in the learning process. 

Unlike KL-SGBN where the discrimination is obtained 
by optimizing the classification performance of SVMs, 
in the second learning framework (termed MM-SGBN), 
we propose to optimize a criterion directly built upon 
the classification performance of SGBNs. The motivation 
is that optimizing the performance of SVMs may not 
necessarily guarantee an equivalent improvement on 
SGBNs when SGBNs are the goal of applications. The 
contribution of this framework is a max-margin based 
method to jointly learn SGBNs, one for each class, for 
both representation and discrimination. 

In addition to the two discriminative SGBN learning 
frameworks, our contributions in this paper also include 
a new DAG constraint of SGBN based on topological 
ordering to ensure the validity of the graph. This new 
DAG constraint circumvents the awkward hard binariza- 
tion of SGBN parameters in the process of optimization 
in EH), and simplifies the related optimization problems. 
This consequently makes it possible to optimize all the 
SGBN parameters together to avoid the influence of 
feature ordering encountered in the Block Coordinate 
Descent (BCD) optimization in | fl4| . Moreover, this new 
DAG constraint also circumvents the need for presetting 
candidate parent sets as in lfT7l . 

Although the discriminative learning frameworks pro¬ 
posed in this paper are general methods, we focus on 
their applications in neuroimaging analysis for the early 

1. In this paper, we deal with the scenario that maintaining the BN 
representation of individual class is critical for the understanding of 
domain problems, such as the brain network models for the healthy 
and the diseased groups. However, it is not difficulty to see our 
discriminative learning frameworks could be slightly modified to learn 
only a single BN as the existing "Bayesian network classifiers" for 
continuous variables. However, this deviates from our motivation and 
therefore is not unfolded in this paper. 
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diagnosis of mental diseases. A newly emerging field 
in the imaging-based neuroscience, called brain network 
analysis, attempts to model the brain as a complex 
network and study the interactions of brain regions via 
imaging-based features l20l . Such research is important 
because brain network change is often found to be a 
response of the brain to damages. Due to its causal se¬ 
mantics, BN has been employed to model the "effective 
connectivity" of the brain fl4l , [2TI . l22l . The direction¬ 
ality of the connections may disclose the pathways of 
how one brain region affects another. The discoveries 
may lend further credence to the evidence of causal 
relationship changes found in many mental diseases, 
such as the Alzheimer's disease (AD) (23l , (14), [24], j22), 
and uncover novel connectivity-based biomarkers for 
disease diagnosis. The proposed learning frameworks 
has been tested on multiple neuroimaging data sets. As 
demonstrated, our methods can significantly improve 
the discriminative power of the obtained SGBNs, as well 
as maintaining their representation capacity. 

Early conference versions of this work were published 
in 125'j. j26|. In this paper, a significant extension has 
been made on the following aspects. First, we analyze 
the problems of the DAG constraint used in l25l , 1251 , 
Ml and propose a new constraint with theoretically 
guaranteed DAG property to overcome those draw¬ 
backs. Second, we experimentally verify the new DAG 
constraint on benchmark Bayesian network data sets for 
network structure learning, and compare our method 
with another eight competing methods in the litera¬ 
ture. Third, we update our two discriminative learning 
frameworks with the new DAG constraint and redo all 
the experiments in our early work [25], [26]. Fourth, 
we analyze the connections and differences between the 
two proposed discriminative learning frameworks, and 
conduct more comprehensive experiments to explore the 
characteristics of our frameworks with varied parame¬ 
ters, which has not been done in l25l , l26l . 

The rest of the paper is organized as follows. Section |2] 
reviews SGBN and introduces the background of brain 
network analysis. Sections [3] elaborates two frameworks 
to learn discriminative and representative SGBNs from 
continuous data. Section 0] revisits the problem of the 
existing DAG constraint of SGBN, and proposes a new 
one based on topological ordering. The proposed two 
learning frameworks with the new DAG constraint are 
experimentally tested in Section [5] This paper is con¬ 
cluded in Section [6] The notations of symbols frequently 
occurring in this paper are summarized in Table [l] 

2 Background 

To make this paper self-contained, we introduce the 
background for both the methodology and its appli¬ 
cation to brain network analysis. Please note that the 
methodology could be generalized to applications be¬ 
yond the example given in this paper. 


TABLE 1 

Notation 


Xi 

a random variable 

X 

a sample of m variables: x = [xi, X 2 , ■ ■ ■ , x m ] T 

X 

the data matrix of n samples, X £ R" xm 

X;, : 

the i-th row of X, representing a sample 

X;,i 

the i-th column of X, representing the realization 
of the random variable Xi on n samples 

© 

the parameters of a Gaussian Bayesian Network 
e = [0i,-■■ ,0 m ], © £ R mxm 

Pa^ 

a vector containing the parents of Xi 

PA; 

a matrix whose j-th column represents a 

realization of Pa; on the j -th sample. 

G 

an m x m matrix for BN: if there is a directed edge 
from Xi to Xj, Gjj = 1, otherwise Gij = 0 

P 

an mx m matrix for BN: if there is a directed path 
from Xi to x-j, P= 1, otherwise Py = 0 


2.1 Sparse Gaussian Bayesian Network (SGBN) 

Because this paper is based on SGBN model, in the 
following, we review the fundamentals of SGBN in fl4l . 
All the symbols are defined in Table [l] 

A Bayesian network (BN) Q is a directed acyclic graph 
(DAG), i.e. there is no closed path within the graph. It 
expresses the factorization property of a joint distribu¬ 
tion p(x) = p( x i Pa; ) ■ The conditional probability 

2=1, ••• ,ra 

p(xi\P&i) is assumed to follow a Gaussian distribution 
in Gaussian Bayesian Network (GBN). Each node Xi is 
regressed over its parent nodes Pai: Xi = OjPai+et, 
where the vector 0, is the regression coefficients, and 
£i ~ Af( 0, erf). The structure of BN could be characterized 
by the m x m matrix G or P (defined in Table Q3, 
representing the edges / paths in the graph, respectively. 

Identifying parent sets is critical for BN learning. 
Traditional methods often consist of two stages: the 
candidate parent sets are initially identified in the first 
stage and further pruned by some criteria in the second 
stage. A drawback arises that when a true parent is 
missing in the first stage, it will never be recovered in 
the second stage. The work in [14] proposed a different 
approach based on sparse GBN (SGBN), denoted as 
H-SGBN in this paper. In H-SGBN, each node Xi is 
regressed over all the other nodes, and its parent set is 
implicitly selected by the regression coefficients 0, that 
are estimated through a constrained LASSO regression. 
The following optimization is solved in fl4l : 

m 

m i n E H x ^ - PA ^ll2 + Arllflilli (2.1) 

i=1 

s.t. @ji x Pij = 0, \/i,j = 1, • • • ,m, i^j. 

A challenge for BN learning is how to enforce the DAG 
property, i.e., avoiding directed cycles in the graph. A 
sufficient and necessary condition for being a DAG is 
proposed in [14], which requires ©y; x Py = 0 for all 
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i and j. Note that P, ;; is an implicit function of 0 r , 
(i.e., P = expm(0), the matrix exponential function of 
0, as in HU). Eqn. (12.1b is difficult to solve. In (131 , 
a block coordinate descent (BCD) method is employed 
to solve a LASSO-like problem efficiently. The whole 
0 is optimized column-wisely and iteratively. In each 
iteration t, only one column of 0, say & j, is optimized 
with P fixed as ph _1 ) in the last iteration. Then 0 u> , 
with the updated column © ;J , is binarized to obtain 
GW, based on which, pW is recalculated by a Breadth- 
first search with Xi being the root node. The process 
is repeated until convergence. H-SGBN simultaneously 
obtains the structure and the parameters of an SGBN via 
learning 0, e.g., there is no edge i —> j if 0 ;J is zero. It 
has been demonstrated to outperform the conventional 
two-stage methods in network edge recovery. 

2.2 Brain Network Analysis 

Neuroimaging modalities and analysis techniques can 
provide more sensitive and consistent measurements 
than traditional cognitive assessment for the early di¬ 
agnosis of disease. Many mental disorders are found 
associated with subtle abnormalities distributed over the 
entire brain, rather than an individual brain region. The 
"distributive" nature of mental disorders suggests the 
alteration of interactions between brain regions (neu¬ 
ronal systems) and thus the necessity of studying the 
brain as a complex network. Brain networks are math¬ 
ematically represented by graphical models, which can 
be constructed from neuroimaging data as follows. The 
brain images belonging to different subjects are first 
spatially aligned to a common stereotaxic space by affine 
or deformable transformation, and then partitioned into 
regions of interest (ROI), i.e., clusters of imaging voxels, 
using either data-driven methods or predefined brain 
atlas. A brain network is then modeled by a graph 
with each node corresponding to a brain region and 
each edge corresponding to the connectivity between 
regions. Brain network analysis studies three kinds of 
brain connectivity. In this paper, we focus on the "effec¬ 
tive connectivity" that describes the influence one brain 
region exert upon another. Some early works in this field 
require a prior model of brain connectivity and most 
have only considered a small number (< 10) of brain 
regions using techniques such as structural equation 
modeling 1271 and dynamic causal modeling 1281 . More 
recently, models such as BN and Granger Causality have 
also been introduced into this field. It is suggested that 
BN may have advantages over those lag-based methods 
for brain network analysis by an experimental fMRI 
study ED . Among BN-related methods, it is worth 
noting that the work in fl4| is completely data-driven, 
which recovers SGBN from more than 40 brain regions in 
fluorodeoxyglucose PET (FDG-PET) images. The method 
employs the strategy of sparsity constraint to handle rel¬ 
atively larger scale BN construction, and circumvents the 
traditional two-stage procedure for identifying parent 


sets in many sparse BN learning methods [12J, llOl . 

3 Proposed Discriminative Learning of 
Generative SGBN 

BN models are by definition generative models, focusing 
on how the data could be generated through an un¬ 
derlying process. In the context of neuroimage analysis, 
these models represent the effective brain connectivity 
of the given population. When used for classification, 
e.g., identifying AD patients from the healthy, the SGBN 
models are trained for each class separately. A new 
sample x, is then assigned to the class with the higher 
likelihood of SGBN. This may ignore some subtle but 
critical network differences that distinguish the classes. 
Therefore, we argue that the parameters of the genera¬ 
tive model should be learned from the two classes jointly 
to keep the essential discrimination. 

Integrating generative and discriminative models is an 
important research topic in machine learning. In [29], 
the related approaches are roughly divided into three 
categories: blending, staging and iterative methods. In 
blending methods, both the discriminative and the gen¬ 
erative terms are incorporated into the same objective 
function. In staging methods, the discriminative model is 
trained on features provided by the generative model. In 
iterative methods, the generative and the discriminative 
models are trained iteratively to influence each other. 
In this paper, we propose two kinds of discriminative 
learning frameworks to achieve our goal. One is a stag¬ 
ing method, called Fisher-kernel-induced discriminative 
learning (KL-SGBN). It extracts sample-based features 
from SGBN by Fisher kernel to optimize the classification 
performance of SVM. The other is a blending method, 
called max-margin-based discriminative learning (MM- 
SGBN). It directly optimizes the classification perfor¬ 
mance of SGBNs subject to maintaining SGBN's repre¬ 
sentation capacity. The two frameworks are elaborated 
in the following sections, respectively. 

3.1 Proposed Fisher-kernel-induced Discriminative 
Learning (KL-SGBN) 

We first introduce the Fisher-kernel-induced discrimina¬ 
tive learning of SGBN, i.e., KL-SGBN. The algorithm is 
illustrated in Fig. [l] and overviewed as follows. Given 
two classes in comparison, two SGBN models (with the 
parameters of ©i and © 2 ) are learned, one for each 
individual class. The original samples are then mapped 
into the gradient space of the SGBN parameters ©i 
and ©2 by Fisher kernel (Section 13.1. lb . Through this 
mapping, each sample is represented by a new feature 
vector (called Fisher vector mi) that is a function of 
0 = [© 1 , © 2 ]. These sample-specific feature vectors 
are then fed into an SVM classifier to minimize its 
generalization errors by adjusting 0 (Section 13.1.2L The 
obtained optimal 0* and ©2 encode the discriminative 
information and therefore improve the original SGBNs. 
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In this way, we convert the discriminative learning 
of SGBN parameters to the discriminative learning of 
Fisher kernels. 



Generative Fisher Discriminative 

Models Kernel Model 


Fig. 1. Illustration of Fisher-kernel-induced Discriminative 
Learning. 


3.1.1 Induction of Fisher vectors from SGBN 

Below we introduce how to use Fisher kernel on SGBNs 
to obtain feature vectors required for kernel learning. 

Fisher kernel [19J provides a way to compare samples 
induced by a generative model. It maps a sample to 
a feature vector in the gradient space of the model 
parameters. The intuition is that similar objects induce 
similar log-likelihood gradients of the model parameters. 
Fisher kernel is computed as /\(x, x') = U“ 1 g x ', 
where the Fisher vector g x = Vg log(p(x|0)) describes 
the changing direction of parameters to better fit the 
model. The Fisher information metric U weights the 
similarity measure, but is often set as an identity matrix 
in practice lfT9l . 

Fisher kernel has recently witnessed successful appli¬ 
cations in image categorization 1501 . 1i3T1 for inducing 
feature vectors from Gaussian Mixture Model (GMM) 
of a visual vocabulary. Despite its success, in the ap¬ 
plications above, Fisher kernel is mainly used as a fea¬ 
ture extractoi@. It has not been applied to learning the 
parameters of probability distributions before the early 
work of this paper in l25l . The advantage of learning 
discriminative Fisher kernel has also been confirmed by 
a recent study that maximizes the class separability H551 
of samples based on Fisher kernel, which is developed 
with different context and different criteria from ours. 

Following [|14j, we only consider © as parameters and 
predefine a. Let £(x|0) = log(p(x|©)) denote the log- 
likelihood. Our Fisher vector for each sample x is 

$ 0 (x) = [V 01 £(x|© 1 ) t , V© 2 £(x| 0 2 ) t ] t , 

where ©i and © 2 are the parameters of the SGBNs 
for the two classes (y = 1,2), respectively. Recall that, 
using a BN, the probability p(x|@) can be factorized as 

2. An exception (U is discussed in "Generalization" in Section 133] 
which is published after our work [25l . 


p(x|0) = J~[ p{xi\PsLi. Oi). Therefore, for GBN it can 

i= I,--- ,m 

be shown that 


£(x|0) = y^logp^lPa^flj) 
2—1 



(3.1) 


log(2? T^m). 


Taking partial derivative over 0;, we have 


<9£(x|0) 

de t 


Pa;Pa, a x,Pa, 

o ” i n 


(3.2) 


= S(a ',i)6i + s 0 (a:,;), 


where S (x,) is a squared matrix and s (J (x,) is a vector. As 
shown, both S(x;) and s 0 (x.;) are constant with respect to 
0. Therefore, the Fisher vector $©(x) is a linear function 
of 0. This simple form of $©(x) significantly facilitates 
our further kernel learning. 


3.1.2 Discriminative Fisher kernel learning via SVM 

As each Fisher vector is a function of the SGBN parame¬ 
ters, discriminatively learning these parameters can thus 
be converted to learning discriminative Fisher kernels. 
We require that the learned SGBN models possess the 
following properties. Firstly, the Fisher vectors induced 
by the learned SGBN model should be well separated 
between classes. Secondly, the learned SGBN models 
should maintain reasonable capacity of representation. 
Thirdly, the learned SGBN models should not violate 
DAG. 

We use the following strategies to achieve our goal. 
Firstly, to obtain a discriminative Fisher kernel, we 
jointly learn the parameters of SGBN and the separat¬ 
ing hyper-plane of SVMs with Fisher kernel. Radius- 
margin bound, the upper bound of the Leave-One-Out 
error, is minimized to keep good generalization of the 
SVMs. Secondly, to maintain reasonable representation, 
we explicitly control the fitting (regression) errors of 
the learned model during optimization. Recall that GBN 
learns the network by minimizing the regression errors 
of each node over its parent nodes. Thirdly, we enforce 
the DAG constraint to ensure the validity of the graph. 
Our method is developed as follows. 

In order to use radius-margin bound, £ 2 -SVM with 
soft margin is be employed IfSlfl which optimizes 

min i||w||2 + C£ T £ (3.3) 

w,£ Z 

s.t. y i (w T $(x;) T b) > 1 - & > 0, Vi. 

Following the convention in SVMs, w is the normal of 
the separating plane, b the bias term, £ the slack variables 
and C the regularization parameter. Here y, is the class 
label of the i-th sample. £ 2 -SVM can be rewritten as 
SVM with hard margin by slightly modifying the kernel 

3. Radius-margin bound is rooted in hard-margin SVM. £ 2 -SVM 
with soft-margin can be rewritten as SVM with hard margin. 




















6 


K := K + I/C, where I is identity matrix. For conve¬ 
nience, in the following, we redefine w := [w T \/~C£, | 1 
and $(xj) := [$ T (x.j) e i r y i /\/C] T . The vector e t has the 
value of 1 at the i-th element, and 0 elsewhere. 

Incorporating radius information leads to solving 

min if? 2 ||w||! (3.4) 

W Z 

s.t. j/ J (w T $(x i ) + b) > 1, Vi, 

where R 2 denotes the radius of Minimal Enclosing Ball 
(MEB). It has been observed that when the sample size is 
small, the estimation of R 2 may become noisy and unsta¬ 
ble (35). Therefore, it has been proposed to use the trace 
of the total scatter matrix instead for such cases l35l , l36l . 
We finally solve the following optimization problem: 

min (Ur(S T )||w||| (3.5) 

6, w Z 

s.t. y l (w T $©(x i ) + b) > 1, Vi 

MX,,©!) < T u h(X 2 ,0 2 )<T 2 , 

©i G DAG, 0 2 G DAG. 

Here (i/St) is the trace of the total scatter matrix St, 
where St = — m)(<E>(xj) — rn) T , and m is 

the mean of total n samples in the kernel-induced space. 
It can be shown that tr(ST) = tr(K) — 1 1 Kl/n, where 
1 denotes a vector whose elements are all 1, and K 
the kernel matrix. Fisher vector 4>©(xi) is obtained as 
in Section T3. 1.1 1 The function h (■) measures the squared 
fitting errors of the corresponding SGBNs for the data 
Xi and X 2 from the two classes. It is defined as 

m 

MX, ©) = Y, l|x : ,i -PATe^. (3.6) 

i =1 

The two user-defined parameters T\ and T 2 explicitly 
control the degree of fitting during the learning. Adding 
these constraints also avoids the scaling problem of 0. 

The DAG constraint in H-SGBN could be employed 
to enforce the validity of the graph. However, here we 
adopt a new DAG constraint proposed in Section [4] due 
to its advantages over that of H-SGBN. The new DAG 
constraint employs a set of topological ordering variables 
(o, T) to guarantee DAG. It is a bilinear function of the 
ordering variables (o, T) and the SGBN parameters 0. 
An elaboration is given in Section [4] At the moment, let 
us temporarily skip the details of this DAG constraint 
and concentrate on the discriminative learning. 

One possible approach for solving Eqn. d3.5b is to 
alternately optimize the separating hyperplane w and 
the parameter 0. That is, 

min J (0) (3.7) 

©,o, Y 

s.t. /i(Xi,©i)<Ti, h(X 2 ,0 2 ) < T 2 , 

0i G DAG(o\, T x ), 0 2 G DAG{ o 2) T 2 ). 


Algorithm 1 KL-SGBN: Discriminative Learning 

Input: data Xi, X 2 G R" xm , label y G R raxl 
Denote 0 = [0i,0 2 ] 

Initialize © (0 \ o l(J) . T' 0,1 by Algorithm [3] for each class. 

Let ©b" 1 ) = 0 (o) , oV- 1 ) = 0 (°), Y (t_1) = Y (0) 

repeat 

1. Compute <Fq 11 and Kg 11 by Eqn. (13.2b 

2. Compute tr(ST) (t_1) = tr(K^ _1) ) - l T K^ _1) l/n 

3. Solve Jo(© (t_1) ) and a* by Eqn. (13.9b 

4. J(0 (t “ 1) ) = Jo(0 (t_1) ) x tr(S T ) (t_1) 

6. Minimize Eqn. ( 13.7b with a* and obtain 0 ! 1 : 

6.1 Let o = = Y^ f_1 \ solve 0 (t ^ by 

Eqn. (13.7b : 

6.2 Let © = 0 (t) , solve o^\ Y (t) by Eqn. d4~2t . 

7. Let 0 (t_1) = © w , = oW, Y^~ 1} = Y (t) 

until convergence/max number of iterations 

Output: 0* = ©(‘) 


where 

J(0) = min itr(S T )||w|| 2 (3.8) 

w Z 

s.t. 2 / l (w T 'F©(x i ) + 6) > 1, Vi. 

Note that for a given 0, the term tr(Sr) is constant in 
Eqn. (13.8b . Due to the strong duality in SVM optimiza¬ 
tion, we solve the term IMI! b y 

n \ n n 

J o (0) = max yiUjaiajK® (xj, Xj) 

i= 1 2— 1 7 = 1 

(3.9) 

n 

s.t. x ^2 / a i y i = 0, OLi > 0 Vz, 

2=1 

where a,; is the Lagrangian multiplier and JT© (x,, x ; ) = 

($©(x i ),$©(x J )). 

As mentioned above, the DAG constraint is a bilinear 
function of (o, Y) and 0. Many quadratic program¬ 
ming packages could be used to solve Eqn. (13.7b . We 
use fmincon-SQP (sequential quadratic programming) in 
Matlab. Gradient information is required by many opti¬ 
mization algorithms (including fmincon-SQP) to speed 
up the line search. It is not difficult to find that the 
gradient of A'©(xi, x^) is just a linear function of 0, mak¬ 
ing the evaluation of gradient V© J easy. Our learning 
process is summarized in Algorithm |T| 

3.2 Proposed Max-margin-based Discriminative 
Learning (MM-SGBN) 

KL-SGBN introduces group discrimination into SGBNs 
by optimizing the performance of SVM classifiers with 
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SGBN-induced features. Although this leads to a rela¬ 
tively simple optimization problem, optimizing the per¬ 
formance of SVMs does not necessarily imply optimiz¬ 
ing the discrimination of SGBNs. We believe that, the 
discrimination of SGBNs can be further improved if 
we directly optimize their (instead of SVMs') classifica¬ 
tion performance. Therefore we propose a new learning 
framework based on max-margin formulation directly 
built on SGBNs. We call this method MM-SGBN. 

For binary classification, maximizing the minimum 
margin between two classes can be obtained by 
maximizing the minimum conditional likelihood ratio 
(MCLR) lllfl : 


MCLR(0) 


1 P(lli | x i) 

1 P{yi\*i,®yi) 


Without loss of generality, y ? ; and y, G {—1,1}/ repre¬ 
senting the true and false labels for the i-th sample, 
respectively. The parameter ® Vi = @i if y, = 1, or 
©j/i = ©2 if y, = —1. We can see that MCLR identifies 
the most confusing sample whose probability of the true 
class assignment is close to or even less than that of the 
false class assignment. Hence, maximizing MCLR targets 
the maximal separation of the most confusing samples 
in the two classes. It is not difficult to see that MCLR 
can naturally handle multi-class case when replacing the 
denominator by the maximal probability induced by all 
false class assignments. Let © = [©i,© 2 ]. Taking log- 
likelihood of MCLR, we have 


log MCLR(0) 

n 

= min (log p(x.i\yi, ® Vi ) - log p(xi\yi, ©yj) + const, 

(3.10) 

where the prior probabilities of P(y^) and P(y,) that 
are irrelevant to © are absorbed into the constant term. 
Eqn. (13.10b can be shown to be a quadratic function of 
© in the case of SGBN. In order to maximize MCLR, 
we require the difference of log-likelihood function in 
Eqn. (13.101 be larger than a margin for all samples, r, and 
maximize the margin r. To deal with hard separations, 
we employ a soft margin formulation as follows. 


n 

min AV G — r (3.11) 

©l,02,£i,r,O,T ' 

2—1 

s.t. yi (£(©i,Xj) - £(© 2 ,Xi)) > r -V* J3.11h ) 

ii > 0, r > 0, @33J>) 

h(X 1; ©x) < Ti, h(X 2 ,@ 2 )<T 2 EHt) 


©i G DAG(o u Ti), © 2 G DAG{ o 2 , X 2 ) EITh ) 

The constraints in (13. 1 lh l enforce the likelihood of x, to 
its true class larger than that to its false class by a margin 
r. The variables G are slack variables indicating the in¬ 
trusion of the margin. The function £(•) denotes the log- 
likelihood, defined in Eqn. (I3.lt . We require £(©i,Xj) 
larger than £(© 2 ,Xj) when y, = 1, and £(© 2 , x,) larger 
than £(©i,Xj) when yi = —1. 


Algorithm 2 MM-SGBN: Discriminative Learning 

Input: data Xi, X 2 G R" xm , label y G R raxl 
Denote 0 = [@i,0 2 ] 

Initialize o l(J) . T' 0,1 by Algorithm [3] for each class. 

Fix © = 0' :l,; and estimate r*- 0 -* and £- IJI by Eqn. d3.11t 

only with the two constraints (I3.11h l and 13.1 I b l. 
Initialize t = 1. 

repeat 

Step 1: Fixing o = o^ -1 ^ and T = optimize 

Eqn. Il3.11t with the constraints (13. 1 lh ~ 13.11 L ) to 
update 0^, rP and e^; 

Step 2: Fixing 0' 1! , optimize Eqn. 14.2b to update o !t> 
and T® to enforce DAG. 

Let t = t + 1 

until convergence/max number of iterations 

Output: 0* = 0 (t) 


The constraints in 13.lib ) control the fitting errors, 
same to that used in KL-SGBN, and the function h(-) 
is defined in Eqn. 13.6b . 

The constraints in 13.lid ) are the DAG constraint 
proposed in Section |4) Eqn. 14.1b . To enforce the validity 
of DAG on both graphs, we introduce a set of order 
variables o = {oi, o 2 , • • • , o m } and T for each class sep¬ 
arately, and employ the constraints stated in Eqn. 14.1b . 
Please refer to Eqn. 14.1b for details. 

The optimization in Eqn. 13.11b can be solved iter¬ 
atively by optimizing and (o, T) alternately, 

as summarized in Algorithm |2] In Step 1, we solve a 
linear objective function with n non-convex and two 
convex quadratic constraints by fmincon-SQP (sequen¬ 
tial quadratic programming) in Matlab. In Step 2, we 
solve the linear programming by the package of CV)fl 

It is worthy noting that, we learn an SGBN model for 
each individual class in order to meet the requirement 
of both interpretation and discrimination in exploratory 
research. For example, each SGBN may model the brain 
network of the healthy or the diseased class, as well 
as carrying the essential class discrimination. Both the 
network modelling and the discrimination are of interest 
in such cases. Our method is different from the con¬ 
ventional BN classifers m, Ha, m that solely focus 
on classification. In those methods, only a single BN is 
learned to reflect the "difference" of the two classes. It 
does not model any individual class as our method does, 
and hence deviates from our purpose of both represent¬ 
ing and discriminating brain networks. Moreover, the 
works in ED, 02, CD cannot handle the continuous 
variables of brain imaging measures, and inherit the 
drawbacks of the traditional two-stage methods. 


4. http://cvxr.com/cvx/ 






































3.3 Discussion and Analysis 

In the following, some issues regarding the two pro¬ 
posed discriminative learning frameworks are discussed. 

Classifiers. The proposed discriminative learning 
frameworks produce a set of jointly learned SGBN mod¬ 
els, one for each class. Based on these SGBN models, 
two kinds of classifiers can be constructed, i.e., the SGBN 
classifier and the SVM classifier. The SGBN classifier cat¬ 
egorizes a sample by comparing the sample's likelihood 
according to each SGBN model. The SVM classifier is 
trained by the sample-specific Fisher vectors induced 
from the SGBN models. These two classifiers are tightly 
coupled by the underlying SGBN models. Specifically, 
more discriminative SGBN models directly lead to a 
better SGBN classifier, and can provide discriminative 
Fisher vectors to SVM for better classification. Rooted 
in this relationship, both the KL-SGBN and the MM- 
SGBN can improve the classification performance of 
these two classifiers simultaneously. Put simply, KL- 
SGBN explicitly optimizes the SVM classifier and in 
turn implicitly improves the SGBN classifier; while MM- 
SGBN explicitly optimizes the SGBN classifier, bringing 
an implicit improvement of the SVM classifier as well. 
When evaluating the discriminative power of the learned 
SGBN models by the SGBN classifier (a direct mea¬ 
surement), it is therefore expected that MM-SGBN can 
outperform KL-SGBN. Flowever, KL-SGBN has some 
computational advantages and provides a new perspec¬ 
tive to manipulate BN models, analyzed as follows. 

Computational Issues. Compared with KL-SGBN, 
MM-SGBN requires to solve more complicated optimiza¬ 
tion problems, which may become problematic when 
the number of training samples increase. Let us com¬ 
pare Eqn. (37} for KL-SGBN and Eqn. (3Tfl for MM- 
SGBN. For KL-SGBN, Eqn. (13.71 1 optimizes </(©) with 
two convex quadratic constraints of data fitting and two 
DAG constraints, which are independent of the number 
of training samples n. The evaluation of J(©) needs 
to solve an SVM-like problem in Eqn. (13.8b . taking just 
n linear constraints of ©, which could be efficiently 
solved by off-the-shelf SVM packages. For MM-SGBN, 
in addition to the data fitting and DAG constraints as in 
Eqn. (37}, the optimization problem in Eqn. (13.11b also 
has to satisfy n non-convex quadratic constraints. When 
n increases to a medium or large value, the optimization 
problem could be quite hard to solve. 

Edge Selection. In addition to the discriminative 
learning of SGBN, the employment of Fisher kernel in 
KL-SGBN also provides a new perspective of edge selec¬ 
tion for GBN. As introduced in Section 13.1.11 applying 
Fisher kernel on GBN produces sample-specific feature 
vectors whose component is the gradient of the log 
likelihood, i.e., . In other words, each feature 

now corresponds to an edge ©y in the SGBN. This 
makes it possible to convert the SGBN edge selection 
to a more traditional feature selection problem that has 
been well studied and has a large body of options in 


the literature. Edge selection has been employed in our 
work to deal with the "small sample size" problem 
that is often encountered in medical applications. For 
example, it is common to have only 100 training samples 
but 3200 parameters (for SGBNs of 40 nodes from two 
classes) to learn in brain network analysis. To handle 
this issue, we keep using the whole © for computing 
K®, but only optimize a selected subset © s . There are 
many options to determine © s . We just compute the 
Fisher vector T>© for each sample, calculate the Pearson 
correlation between each component of $© and the class 
labels on the training data, and select the top 0, with 
the highest correlations. To keep our problem simple, 
only the parameters associated with edges present in 
the graph are optimized to avoid the violation of DAG. 
It is remarkable that even this simple selection process 
has significantly improved the discrimination for both 
KL-SGBN and MM-SGBN. Note that this edge selection 
step is essentially different from that of the traditional 
two-stage methods. It is just an empirical method to 
handle the small sample size problem and will become 
unnecessary when sufficient training data are available. 
In contrast, identifying the candidate-parent sets is an 
indispensable step in two-stage methods to obtain com¬ 
putationally tractable solutions. 

Generalization. We would like to point out that our 
learning framework of KL-SGBN could be easily gen¬ 
eralized. It could be used to discriminatively learn the 
parameters of distributions other than that represented 
by GBN by just simply switching GBN to the target 
distribution, such as Gaussian Mixture Model (GMM). 
Indeed, this has been seen in Il32l. after our work [|25J. 
Flowever, as shown in this paper, the Fisher vector 
of GBN is a linear function of the model parameters, 
which significantly simplifies the learning problem. This 
favorable property may not be guaranteed with other 
distributions, including GMM. 

4 Proposed DAG Constraint 

In this section, we revisit H-SGBN and propose a new 
DAG constraint that could simplify the optimization 
problems in SGBN and its discriminative learning pro¬ 
cess as introduced in Sections 13.11 and 13.21 

4.1 H-SGBN Revisited 

Recall that, the DAG constraint in H-SGBN (Section 12.1b 
utilizes the matrix P, an implicit function of 0, which 
significantly complicates the optimization problem in 
Eqn. (12.lb . In Iil4ij . for simplicity, in each optimization 
iteration, P is first treated as a constant while optimizing 
©, and then recalculated by searching on the binarized 
new ©. This hard binarization could introduce high 
discontinuity of © into the optimization. Solving 0 
column-wisely by BCD may mitigate this problem since 
only one column of © is changed in each iteration, 
inducing less discontinuity. However, we observe that 
the solution of BCD depends on which column of 0 
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to be optimized first. In other words, if we randomly 
permute the ordering of features (the columns in X), 
we will obtain different SGBNs, which impairs the 
interpretability of the SGBN model. The optimization 
ordering matters because the matrix P used in the DAG 
constraint changes with the ordering. This problem has 
been demonstrated in our experiment. Moreover, we find 
experimentally that if P is solved as a whole instead of 
BCD, the optimization in Eqn. (12.Il l will not converge 
but oscillate between some non-DAG solutions, possibly 
due to the high discontinuity mentioned above q Early 
stop cannot help because no premature solution satisfies 
DAG. These optimization difficulties motivate our work 
of proposing a new DAG constraint that is much simpler 
for SGBN, as described below. 

4.2 Proposed DAG constraint 

It is known that, a BN is equivalent to a topological 
ordering (Page 362 in 1371 ). Therefore, we propose a 
new DAG constraint applicable to continuous variables 
with GBN based on this equivalence. With a few linear 
inequalities and variables separable from 0, the new 
DAG constraint significantly simplifies that used in [14]. 
Specifically, given a directed graph Q and the parameters 
0, a real-valued order variable Oi is assigned to each 
node i, where 0 < Oi < A, and A is a predefined arbitrary 
positive number. We propose a sufficient and necessary 
condition for Q to be DAG as in Proposition 1. 

Proposition 1. Given a sparse Gaussian Bayesian Net¬ 
work parameterized by 0 and its associated directed 
graph Q with m nodes, the graph Q is DAG if and only if 
there exist some Oi (i = 1, • • • , to) and Y € R mxm , such 
that for arbitrary A > 0, the following constraints are 
satisfied: 

(4.1) 

Oj - Oi > — - T i:j , Vi,j e {1, • • • ,m}, i^j (Oh) 
m 

Yy > 0, Oh) 

Yy X 0y = 0, Ot) 

A > Oi > 0. Qh ) 

Eqn. (l4.1b leads to a topological ordering equivalent to 
DAG. The topological ordering means that if node j 
comes after node i in the ordering (Oj > Oi ), there 
cannot be a link from node j to node i, which guarantees 
the acyclicity. The proof of Proposition 1 is given in 
Appendix. 

By Proposition 1, we remove the awkward hard bi- 
narization for computing P in fT4j. The inequalities of 
(14.lh . 14.1b . 14.1 H ) are linear to the ordering variables o, 
and T. The equation d4.H i') differs from the equation 
&j i x P ij = 0 in m in that the variable Yy is now 
separable from ©y (while Py is not) and does not 
require the binarization of 0. This makes it tractable to 

5. Please note that, solving © column-wisely without updating P in 
each iteration will only lead to non-DAG solutions 


solve © as a whole instead of BCD (to avoid the feature 
ordering problem). 

It is worth noting that, provided 0 is sparse, the 
number of constraints in Eqn. (14.1b could be significantly 
reduced. As can be seen, for any ©y = 0, as long as 
we set the corresponding Yy an arbitary value greater 
than (d- + 1)A, all the conditions in Eqn. (14.111 will 
be automatically satisfied. Therefore, we only need to 
consider the constraints related to 0y =/= 0. 

The idea of topological ordering is also used to de¬ 
sign DAG constraint for the discrete variables in (38l . 
However, the work in 1381 addresses the multinominal 
distribution of discrete variables, while here we target 
the Gaussian distribution of continuous variables. It is 
worthy noting that the constraint in lf38l has to predefine 
candidate parent-node sets. Therefore, it inherits the 
drawbacks of the two-stage methods as pointed out in 
Section [T] This has been circumvented in our proposed 
DAG constraint for SGBN. 

4.3 Estimation of SGBN from A Single Class 

With our DAG constraint proposed in Eqn. d4.H l. we 
could estimate SGBN from a single class as the initial 
solution to our discriminative learning of KL-SGBN or 
MM-SGBN. In particular, we optimize 

m 

min V||x : , i -PA7e i ||2 + A 1 ||0 i || 1 + A dos e i T |0 i | (4.2) 

©,o.Y z —' 

i=1 

s.t. Oj -Oi > — - Yy,Vi,j e {1, - - • ,to}, i^j 

TO 

0 < Oi < A, Yy > 0, 

where e, is the i-th column of the matrix Y, and d,| the 
component-wise absolute value of 0 ,. This optimization 
problem is solved in an iterative way with two alternate 
steps in each iteration: i) optimize o and Y (with 0 fixed) 
and ii) optimize © (with o and Y fixed). This process 
is repeated until convergence. We call this proposed 
method OR-SGBN (Algorithm [3]l. 

When the coefficient A dog is sufficiently large, the alter¬ 
nate optimization strategy of Eqn. d4.2b will converge to a 
DAG solution, as shown in Proposition 2 in Appendix. 
In practice, for numerical stability, we adopt a "warm 
start" strategy as in Ifl4l . that is, to gradually increase 
the values of A dag until the resulting Q becomes DAG. 
Specifically, we use a set of values of A dag'- A ^ < A^ < 
•• • < to solve Eqn. d4.2t (Algorithm [3j. 

We use a bias variable xq = 1 in the regression model 
to improve data fitting, thus a= \0, do] 1 jPa, 1] + e, 
(i > 1). In the following part, we denote 0, = [Oi d 0 ] 
and Pa, = [Pa, 1], The bias term 0 o is learned together 
with other 6, . This equals to introducing a bias node into 
the graph. It has no parent but is the parent of all the 
other nodes. If the original graph is a DAG, this does 
not cause the violation of DAG. 

It is interesting yet challenging to analyze the network 
consistency of OR-SGBN. It is noted that Eqn. (14.211 
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Algorithm 3 OR-SGBN: SGBN from a single class 
Input: data X <E R” xm 

Initialize ©' ()l by least square fitting. 

Initialize o^ and Y ' 01 by solving Eqn. (14.21 with 0 = 

0 (o) . 

Let T = 1. 

repeat 

Fixing Y = Y( t_1 ) and o = o^ T-1 ). 

Let t = 1, 0 (T-i,t=o) = ©(t-i)_ 

for \ dag = A« to do 

Optimize Eqn. d4.2b with the initial solution 

©(T-i.t-i) to obtain 0(r-i,t) > 

Let t=t+l. 

end for 

Let 0 (T) = ©( T - 1 ' M ). 

Fixing 0 ' 7 ; , optimize Eqn. (14.21) to update o !T; and 
Y !T ) to enforce DAG. 

Let T = T + 1. 

until convergence/max number of iterations 

Output: ©* = © (T) 


can be reorganized into a weighted LASSO problem, 
which can be conceptually linked to "adaptive LASSO" 
in the literature 11391 . 1401, |4l1 . The analysis frame¬ 
work provided by these works is suggestive of promis¬ 
ing strategies to analyze the network consistency for 
Ll-penalized Gaussian networks. However, a complete 
treatment of this analysis for OR-SGBN requires a deep 
investigation. Considering the significant amount of the 
required workload and its importance, we will explore 
this problem in a separate paper in our future work. 

5 Experiment 

In this section, we investigate the properties of our pro¬ 
posed methods from three aspects: the DAG constraint, 
the discriminative learning process, and the resulting 
connectivity for brain network analysis. Four experi¬ 
ments are conducted, summarized in Table [2] The data 
sets and the experiments are elaborated as follows. 

5.1 Neuroimaging Data Sets 

We conduct our experiment on the publicly accessible 
ADNI |42j database to analyze brain effective connec¬ 
tivity for the Alzheimer's disease. Three data sets are 
used from two imaging modalities of MRI and FDG-PET 
downloaded from ADNI. They are elaborated as follows. 
MRI data set includes 120 Tl-weighted MR images be¬ 
longing to 50 mild cognitive impairment (MCI) patients 
and 70 normal controls (NC). These images are prepro¬ 
cessed by the typical procedure of intensity correction, 
skull stripping, and cerebellum removal. We segment the 
images into gray matter (GM), white matter (WM), and 


cerebrospinal fluid (CSF) using the standard FSL@ pack¬ 
age, and parcellate them into 93 Region of Interest (ROI) 
based on an ROI atlas |43l after spatial normalization. 
The GM volume of each ROI is used as the imaging fea¬ 
ture to characterize each network node. Forty ROIs are 
included in this study, following fl4j . They have higher 
correlation with the disease and are mainly located in 
the temporal lobe and subcortical region. Studying brain 
morphology as a network can take the advantage of 
statistical tools from graph theory. Moreover, it has been 
reported that the covariation of gray matter morphology 
might be related to the anatomical connectivity |44J. 
PET data set includes 103 FDG-PET images (and their 
corresponding MR images) of 51 AD patients and 52 
NC. The MR images belonging to different subjects are 
co-registered and partitioned into ROIs as before. The 
ROI partitions are copied onto their corresponding PET 
images by a rigid transformation. The average tracer 
uptakes within each ROI is used as the imaging feature 
to characterize each network node. Forty ROIs discrimi¬ 
native to the disease are used in the study. The retention 
of tracer in FDG-PET is analogous to the glucose uptake, 
thus reflecting the tissue metabolic activity. 

MRI-II data set is similar to the MRI data set but 
using 40 different ROIs covering the typical brain regions 
spread over the frontal, parietal, occipital and temporal 
lobes. 

We randomly partition each data set into 30 groups of 
training-test pairs. Each group includes 80 training and 
40 test samples in MRI and MRI-II, or 60 training and 
43 test samples in PET. 

5.2 DAG Constraint 

With our proposed DAG constraint, the SGBN model 
for an individual class can be learned with all the 
parameters © optimized together (OR-SGBN), instead 
of column-wisely as did in fill , 1251 . 1251 . To explore 
the properties of our DAG constraint, we test three ex¬ 
perimental configurations, namely, OR-SGBN (WHOLE), 
H-SGBN (BCD) and H-SGBN (WHOLE). The word in 
the parenthesis is used to explicitly indicate whether 
the parameters © are optimized together (WHOLE) or 
column-wisely (BCD). OR-SGBN (WHOLE) is our SGBN 
learning method for a single class in Algorithm [3l im¬ 
plemented with the package of CVX. H-SGBN (BCD) 
is the column-wise method in DU and implemented 
with the code downloaded from the authors' website. H- 
SGBN (WHOLE) is our attempt to optimize © together 
for the objective function of H-SGBN in (14|. which is 
implemented with the package of CVA0. The same © 
that is computed by a sparse least square fitting of the 
training set is provided to all the methods to initialize 
the optimizations. The "warm-start" strategy is applied 
wherever applicable in all methods. 

6. http://fsl.fmrib.ox.ac.uk/fsl/fslwiki/ 

7. The optimization problem is solved by a series of convex sub¬ 
problems. 





TABLE 2 

Summary of Experiment Purpose 
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Experiment 

Test Subject 

Purpose 

Exp-I (Sec. 15.21 
Exp-II (Sec. IT2l 
Exp-III (Sec. ED 
Exp-TV (Sec. [54} 

DAG constraint 

DAG constraint 

discriminative learning 
brain network analysis 

Test the invariance of solution to feature ordering 

Test the ability of network structure recovery 

Test the improvement of discriminative power of SGBN models 
Investigate the learned brain connectivity patterns 


It is found that when solving all 0 as a whole, H- 
SGBN (WHOLE) that uses the DAG constraint in fl4| 
does not converge: the optimization is trapped to oscil¬ 
late between a few solutions that are not DAG. There¬ 
fore, from now on, we only consider H-SGBN (BCD) and 
OR-SGBN (WHOLE). 

Exp-I. In this experiment, we compare the solutions 
of OR-SGBN (WHOLE) and H-SGBN (BCD) with re¬ 
spect to the change of feature ordering. To do that, 
for the neuroimaging data sets, we randomly permute 
the feature ordering for 100 times. The estimated © 
of the resulting 100 SGBNs are re-arranged according 
to the initial feature ordering and then averaged as in 
Fig. [2] As shown, the averaged result from OR-SGBN 
(WHOLE) (Fig. |2 (d)) is almost identical to the result 
using the original feature ordering (Fig. |2 (c)), reflecting 
its robustness to feature ordering. In contrast, H-SGBN 
(BCD) generates SGBNs with large variations when the 
feature ordering changes ((Fig. [2(a) versus (b)). To give 
a quantitative evaluation, the Euclidean distance and the 
correlation between the averaged © and the original 
© are presented in Table |2 Consistently, the solutions 
from OR-SGBN (WHOLE) are much less affected by 
the ordering permutation, indicating the advantage of 
solving © as a whole via the proposed DAG constraint. 

TABLE 3 

Quantitative Analysis of © for the random 
permutation of feature ordering (between the original 
and the averaged ®) 



Distance 

Correlation (R) 

OR-SGBN 

©i 

0.08 

0.9996 

(WHOLE) 

©2 

0.18 

0.9981 

H-SGBN 

©1 

1.91 

0.6828 

(BCD) 

©2 

2.06 

0.6396 


Exp-II. In this experiment, we test the ability of OR- 
SGBN (WHOLE) at identifying network structures from 
data. Since no ground-truth is available for the three 
neuroimaging data sets due to the unknown mecha¬ 
nism of the disease, we conduct experiments on nine 
benchmark network data sets mostly coming from the 
Bayesian Network Repository |45l as was done in the 
literature fl~2l . |46| . The nine benchmark data sets are: 
Factors (27 nodes, 68 arcs). Alarm (37 nodes, 46 arcs). 


Original 


0.8 

0.6 



- 0.2 


- 0 . 4 1 -‘-‘-‘— 

0 500 1000 1500 


After Permutation 



(a) H-SGBN (BCD) 


(b) H-SGBN (BCD) 


Original 



(c) OR-SGBN (WHOLE) 


After Permutation 



(d) OR-SGBN (WHOLE) 


Fig. 2. One example of the estimated parameter © for the 
MCI class (reshaped as a long vector) with regard to the 
random permutation of the feature ordering. Quantitative 
measurements of the changes are given in Table\3\ 


Barley (48 nodes, 84 arcs), Carpo (61 nodes, 74 arcs). 
Chain (7 nodes, 6 arcs), Hailfinder (56 nodes, 66 arcs). 
Insurance (27 nodes, 52 arcs). Mildew (35 nodes, 46 arcs) 
and Water (32 nodes, 66 arcs). We compare the OR-SGBN 
(WHOLE) with another eight BN learning methods, 
including LIMB (121 , GS ||47| . TC and its variant TC- 
bw Ifl3l and three variants of IAMB |[48l . The experiment 
is repeated for 50 simulations. In each simulation, for 
each network, we randomly sample 1000 samples from 
±Uniform(0.5,1) for the regression coefficients of each 
variable on its parents. The parameters of the eight 
methods to be compared are set according to fl4]. A 
predefined A that controls the sparsity of OR-SGBN 
is uniformly applied to all the nine data sets, which 
simply brings the number of the resulting edges to a 
reasonable range @. We use the first stage estimate of 
LIMB as the initial solution of OR-SGBN. Table 12 shows 
the total numbers of mis-identified edges (including both 
the false and the missing edges), while Table 0 shows 
the numbers of falsely identified edges (false positive). 

8. The Bayesian Information Criterion is used to select A in |14| . 
However, it did not behave well in our experiment. 
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In addition. Table [6] lists the numbers of falsely identi¬ 
fied PDAG structures. PDAG structures are statistically 
indistinguishable structures, i.e., representing the same 
statistical dependency. The PDAG of BN is obtained 
by the method in 1391 . From Tables [4] ~ [6j it can be 
seen that OR-SGBN shows significantly smaller errors on 
six data sets (Factors, Alarm, Barley, Carpo, Hailfinder 
and Insurance) in identifying both edges and PDAG 
structures. For the data sets of Mildew and Water, OR- 
SGBN performs similarly to the other methods. It only 
performs relatively inferior on Chain. This experiment 
demonstrates that the proposed DAG constraint for 
SGBN can perform effectively for BN structure identi¬ 
fication. Its relatively low risk of mis-edge identification 
is a favorable property for exploratory research. 

5.3 Comparison of Discrimination 

After testing the effectiveness of the proposed DAG 
constraint, we now investigate the theme of this paper: 
the discriminative learning frameworks. We consider 
two kinds of classifiers: i) the SGBN classifier (with 
two SGBN models, one for each class), and ii) the SVM 
classifier learned by the Fisher vectors induced from the 
SGBN models. 

Exp-III. In this experiment, we test whether our learn¬ 
ing methods (KL-SGBN and MM-SGBN) can improve 
the discriminative power on both kinds of classifiers for 
the real neuroimaging data sets. The initial SGBN models 
are obtained by our proposed OR-SGBN (WHOLE), since 
it has been shown more robust to feature ordering than 
H-SGBN as above. For the SGBN classifier, assuming 
equal prior, we assign a test sample to the class with a 
higher likelihood. For the SVM classifier, we use d^-SVM 
with Fisher kernels. In order to maintain representation 
capability, we allow maximal 1% additional squared 
fitting errors (that is, T) = 1.01 x T)o, (* = 1,2), where 
Tm is the squared fitting error of the initial solution) to 
be introduced during the learning process of KL-SGBN 
or MM-SGBN. 

The test accuracies are averaged over the 30 randomly 
partitioned training-test groups. The classification per¬ 
formances of SGBN and SVM classifiers are evaluated 
with the varied parameter A that controls the sparsity 
level and the number of edges optimized in the learning 
process in Fig. [3] The results of our proposed KL-SGBN 
and MM-SGBN are plotted by the green and the red 
lines, respectively. The results of the individually learned 
OR-SGBN and H-SGBN are plotted by the blue and the 
black lines, respectively. The top two rows in Fig. [3] 
correspond to the results from the SGBN classifiers, 
while the bottom two rows correspond to those from 
the SVM classifiers. From Fig. [3j we have the following 
observations. 

i) Both KL-SGBN and MM-SGBN can significantly 
improve the discriminative power of the individually 
learned SGBNs (Fig. [3j the top two rows), as well as 
their associated SVM classifiers (Fig. [3j the third row). 


Such improvements are consistent over the three neu¬ 
roimaging data sets and different parameter settings, 
and could reach the significant increases of 10% ~ 20% 
on most occasions. When the network becomes more 
sparse, the classification performance of the individually 
learned SGBNs (H-SGBN and OR-SGBN) drops signifi¬ 
cantly possibly due to the insufficient modeling of data. 
However, under such circumstances, KL-SGBN and MM- 
SGBN can still maintain high classification accuracies, 
which may indicate the necessity and effectiveness of 
the discriminative learning in classification scenarios. 

ii) When using SGBN classifiers, for all the three 
data sets, MM-SGBN consistently achieves higher test 
accuracies at all sparsity levels (Fig. |3j the first row) 
with different numbers of optimized edges than KL- 
SGBN (Fig. [3j the second row). The advantage of MM- 
SGBN over KL-SGBN comes from explicitly optimizing 
the discriminative power of SGBNs, instead of getting 
help from optimizing the performance of SVM on SGBN- 
induced features. 

iii) When using SVM classifiers, the SVM built upon 
KL-SGBN-induced features performs slightly better than 
that built upon MM-SGBN-induced features at all spar¬ 
sity levels (Fig. [3j the third row). This is expected since 
KL-SGBN optimizes the performance of its associated 
SVM classifier. 

iv) When cross-referencing the first and the third rows 
in Fig. [3j it is noticed that SVM classifiers in general 
perform worse than the discriminatively learned SGBN 
classifiers. These may be because our Fisher vectors have 
very high dimensionality, which causes serious overfit¬ 
ting of data in SVM classifiers. Such situation might be 
somewhat improved for SGBN-classifiers since the sim¬ 
ple Gaussian model may "regularize" the model fitting. 
Based on this assumption, we further select a number of 
leading features from Fisher vectors by computing the 
Pearson correlation of the features and the labels, and 
use the selected features to construct the Fisher kernel 
for the SVM classifiers. As shown in the fourth row 
of Fig. |3j the simple feature selection step can further 
significantly improve the classification performance of 
the Fisher-kernel based SVM. 

v) The individually learned OR-SGBN and H-SGBN 
perform similarly for classification. However, as men¬ 
tioned above, OR-SGBN has an additional advantage of 
being invariant to the feature ordering. 

vi) Recall that these improvement on discrimination 
are achieved with no more than 1% increase of squared 
fitting errors, which is explicitly controlled through the 
user defined parameters T) and T%. Note that the rate 
of 1% is application dependent. More tolerance of fitting 
errors can potentially bring more discrimination. 

5.4 Comparison of Connectivity 

We also conduct an investigation to gain some insights 
into the learned brain networks for the diseased and the 
healthy populations, respectively. 
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TABLE 4 

Total errors (number of both false and missing edges, averaged on 50 simulations) on benchmark networks 



LIMB 

GS 

TC-bw 

TC 

IAMB 

IAMBI 

IAMB2 

IAMB3 

OR-SGBN 

Factors 

101.48 

104.50 

102.90 

103.02 

103.14 

103.30 

103.14 

103.14 

54.82 

Alarm 

56.58 

59.30 

57.76 

60.60 

61.76 

59.16 

61.76 

61.76 

44.40 

Barley 

113.24 

114.70 

114.38 

122.78 

123.80 

109.92 

123.80 

123.80 

99.26 

Carpo 

125.74 

131.72 

131.18 

133.16 

132.76 

132.90 

132.76 

132.76 

25.58 

Chain 

5.32 

4.88 

5.50 

4.42 

4.70 

5.00 

4.70 

4.70 

7.04 

Hailfinder 

91.50 

94.94 

96.18 

99.02 

103.10 

92.74 

103.10 

103.10 

57.04 

Insurance 

74.78 

74.64 

73.74 

76.30 

78.78 

73.04 

78.78 

78.78 

59.04 

Mildew 

60.86 

60.74 

59.66 

63.80 

68.46 

92.74 

103.10 

103.10 

57.04 

Water 

92.86 

94.04 

90.24 

97.16 

102.70 

90.06 

102.70 

102.70 

93.08 


TABLE 5 

Number of falsely identified edges (averaged on 50 simulations) on benchmark networks 



LIMB 

GS 

TC-bw 

TC 

IAMB 

LAMB1 

IAMB2 

LAMB3 

OR-SGBN 

Factors 

47.66 

50.74 

49.40 

49.74 

50.28 

49.70 

50.28 

50.28 

17.70 

Alarm 

36.04 

37.72 

36.86 

39.24 

40.96 

37.30 

40.96 

40.96 

23.14 

Barley 

71.70 

72.30 

72.60 

80.76 

82.96 

69.76 

82.96 

82.96 

48.70 

Carpo 

71.96 

76.30 

75.14 

77.38 

77.18 

77.36 

77.18 

77.18 

14.56 

Chain 

2.66 

2.44 

2.76 

2.22 

2.36 

2.50 

2.36 

2.36 

3.52 

Hailfinder 

60.00 

62.04 

63.16 

65.42 

66.40 

60.90 

66.40 

66.40 

28.66 

Insurance 

42.80 

42.16 

41.72 

44.06 

48.08 

41.42 

48.08 

48.08 

31.20 

Mildew 

46.22 

46.46 

45.46 

49.82 

52.48 

44.82 

52.48 

52.48 

33.86 

Water 

64.52 

65.02 

63.70 

68.06 

74.22 

63.48 

74.22 

74.22 

46.74 


TABLE 6 

Number of falsely identified P-DAG structures (averaged on 50 simulations) on benchmark networks 



LIMB 

GS 

TC-bw 

TC 

IAMB 

IAMBI 

IAMB2 

IAMB3 

OR-SGBN 

Factors 

107.20 

109.54 

108.96 

108.84 

109.22 

108.84 

109.22 

109.22 

63.40 

Alarm 

61.74 

64.08 

62.54 

65.34 

66.82 

63.98 

66.82 

66.82 

51.02 

Barley 

120.54 

122.26 

121.38 

130.04 

131.24 

116.92 

131.24 

131.24 

105.50 

Carpo 

129.02 

135.34 

134.78 

136.92 

136.74 

136.22 

136.74 

136.74 

33.74 

Chain 

5.96 

5.54 

6.12 

5.16 

5.30 

5.66 

5.30 

5.30 

7.42 

Hailfinder 

103.72 

106.08 

107.56 

110.04 

113.44 

104.86 

113.44 

113.44 

63.76 

Insurance 

81.58 

81.68 

81.44 

83.70 

86.60 

80.66 

86.60 

86.60 

68.26 

Mildew 

61.68 

61.32 

60.34 

64.48 

69.30 

58.08 

69.30 

69.30 

67.24 

Water 

96.34 

97.46 

93.80 

100.38 

106.14 

93.60 

106.14 

106.14 

94.52 


Exp-IV. In this experiment, we visualize the learned 
brain networks and compare the connectivity patterns 
obtained by different methods and from different pop¬ 
ulations. MRI-II data set is used for this study since it 
covers regions spread over the four lobes of brain. 

The structures of the brain networks recovered from 
NC and MCI groups are displayed in Fig. 0] by using 
H-SGBN (BCD) and OR-SGBN (WHOLE), respectively. 
The network structure is obtained by thresholding the 
edge weights © with a cutoff value of 0.01 [14j. Each 
row i represents the effective connections (dark dots) 
starting from the node i, and each column j represents 
the effective connections ending at the node j. Note that, 
due to the different optimization problems involved, the 
same parameter A leads to different sparsity levels for 
H-SGBN and OR-SGBN. However, for a given method, 
different A values do not change the major structures of 
the resulting networks. 

In Fig. 01 it is noticed that H-SGBN (BCD) usually 
generates more connections in the upper triangle of the 
graphs even when we randomly permute the nodes. We 


suspect that this is caused by the column-wise optimiza¬ 
tion. The parameters 0, (corresponding to the columns 
in the graph) optimized at the early stage tend to be 
made more sparse than those optimized later in order 
to satisfy the DAG constraint. This phenomenon is not 
observed in OR-SGBN (WHOLE) that is used to initialize 
the discriminative learning. 

Let us focus on OR-SGBN. Compared with H-SGBN, 
OR-SGBN has an additional bias node corresponding to 
the last row and column in Fig. 0] Visualizing © can 
provide rich information for medical analysis. Here we 
just list a few observations as examples. With the same 
A, OR-SGBN produces 183 edges for NC, and 145 edges 
for MCI. Such loss of connectivity also happens at the 
temporal lobe (24% loss) for MCI. The temporal lobe 
(and some subcortical structures) is known to play a 
very important role in the progression of AD. The loss of 
connectivity in this region has been well-documented in 
wide AD-related studies (50) . (5D . |jl4). In Fig. 01 we also 
observe an increase of connectivity (the left bottom cor¬ 
ner in the figure) between the frontal and the temporal 
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Fig. 3. Comparison of classification accuracies on data sets of MRI (the left column), PET (the middle column) and 
MRI-II (the right column). The top two rows correspond to the test accuracies obtained by the learned SGBNs. The 
first row shows the test accuracies varied with the sparsity levels (i.e., the parameter X). The second row shows the 
test accuracies varied with the number of edges (denoted as “#Sei Edges” in the figure) optimized in discriminative 
learning. The bottom two rows correspond to the test accuracies obtained by SVMs using the SGBN-induced Fisher 
vectors either in full length (the third row) or with (100) selected components (the fourth row). 


lobe in MCI. Some study (52ll mentioned that the frontal 
lobe may have connectivity increase at the early stage 
of AD as a compensation of cognitive functions for the 
patients. Moreover, significant directionality changes are 
also found for the left (node 35) and the right (node 38) 


hippocampi, an important structure among the earliest 
ones affected by AD. Both hippocampi have reduced 
incoming connectivity (communications dominated by 
other nodes) but increased outgoing connectivity (com¬ 
munications dominated by themselves) in MCI. Please 
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note that the above observations could be influenced by 
the factors such as the limited number of data, the degree 
of disease progression and the imaging modality used in 
this study. More reliable medical analysis should be val¬ 
idated on larger data sets and worth further exploration, 
which is, however, beyond the scope of this paper. 




H-SGBN (BCD): NC H-SGBN (BCD): MCI 




OR-SGBN (WHOLE): NC OR-SGBN (WHOLE): MCI 

Fig. 4. Visualization of connectivities for MR I-11. The four 
red boxes correspond to the frontal, parietal, occipital and 
temporal (including subcortical regions) lobes of the brain. 
The green row (Row 35) and column (Col 35) correspond 
to the left hippocampus while the blue ones (Row 38 and 
Col 38) correspond to the right hippocampus. 

To illustrate the difference of edge weights learned by 
KL-SGBN and MM-SGBN, an example of 30 edge weight 
changes (from the initial OR-SGBN) learned by these two 
methods is given in Fig. [5j where the SGBN networks 
from the two classes are vectorized and concatenated as 
x-axis. As shown, the signs of weight changes are quite 
similar in both methods. The most significant difference 
is that, MM-SGBN gives negative weight changes to the 
bias node of the left Amygdala and the right Parahip- 
pocampus (red lines in Fig. [5]l while KL-SGBN gives 
them positive weight changes. The adjustment of edge 
weights leads to 10% increase of test accuracy for MM- 
SGBN in this example. 

6 Conclusion 

In this paper, we focus on the discriminative learning of 
Bayesian network for continuous variables, especially for 
neuroimaging data. Two discriminative learning frame¬ 
works are proposed to achieve this goal, i.e., KL-SGBN 
improves the performance of SVM classifiers based on 
SGBN-induced features, and MM-SGBN explicitly op¬ 
timizes an SGBN-based criterion for classification. We 
demonstrate how to utilize Fisher-kernel to bridge the 
generative methods of SGBN and the discriminative 
classifiers of SVM, and how to embed the max-margin 




(a) KL-SGBN (b) MM-SGBN 

Fig. 5. An example: change of edge weights learned by 
KL-SGBN and MM-SGBN 


criterion into SGBN for discriminative learning. The 
optimization problems are analyzed in details, and the 
advantages and disadvantages of the proposed meth¬ 
ods are discussed. Moreover, a new DAG constraint 
is proposed to ensure the validity of the graph with 
theoretical guarantee and validation on the benchmark 
data. We apply the proposed methods to modeling brain 
effective connectivity for early AD prediction. Significant 
improvements have been observed in the discriminative 
power of both the SGBN models and the associated 
SVM classifiers, without sacrificing much representation 
power. 
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Appendix 


Proposition 1. Given a sparse Gaussian Bayesian Network parameterized by 0 and its associated directed graph 
Q with m nodes, the graph Q is DAG if and only if there exist some o, (i = 1, • • • , to) and T £ R mxm , such that for 
arbitary A > 0, the following constraints are satisfied: 


Oj-Oi> -Y ij, \/i,j £ {1, — ,to}, iyfj (la) 

TO 

Y ^ > 0, (lb) 

x & i:j = 0, (lc) 

A > 0i > 0. (Id) 


Proof. As is known, a Bayesian network is equivalent to a topological ordering (Chapter 8, Section 8.1 on Page 
362 in Il37| ). Therefore, we prove Proposition 1 by showing that i) Eqn. (la ~ Id) lead to a topological ordering 
(the necessary condition), and ii) a topological ordering from a DAG can meet the requirements in Eqn. (la ~ Id) 
(sufficient condition). 

First, we prove the necessary condition by contradiction (Fig. |6|. We consider three cases for two nodes j and i. 
Case 1) the nodes j and i are directly connected. If there is an edge from node i to node j, the parameter 0 , :i is 
then non-zero, and thus Ty must be zero. According to Eqn. (la), we have o :/ > o,;. If at the same time, there is an 
edge from node j to node 'i, similarly we have o, > o :j , which contradicts with Oj > Oi, and therefore is impossible. 
Case 2: the nodes j and i are not directly linked but connected by a path. Suppose there is a directed path PI 
from node i to node j, where PI is composed of nodes k\ - ■ ■ , k mi in order. Following the above proof, we can 
have Oj > Ok m > • • • > Ok 1 > o,. If at the same time another directed path P2 links node j to node i, where P2 is 
composed of nodes h, I 2 , ■ ■ ■ , l m2 i n order, similarly we have o, > o; m2 > • ■ ■ > o/, > o t , making the contradiction. 
Case 3) If there is no edge between node i and node j, by definition 0 y = 0. It is straightforward to see Eqn. (lb) 
and Eqn. (lc) hold for any arbitrary non-negative Y fJ . Moreover, for any o, and o ; satisfying Eqn. (Id), we can 
show that as long as Ty > (— + 1)A (which is positive), Eqn. (la) will always hold. This is further explained as 
follows. By Eqn. (Id), we have —A < o :i — o, < A. For Eqn. (la) to be always held, we need some Y VJ such that 
Oj - Oi > -A > ^ - Y ijr which requires Y,j > (— + 1)A. Therefore, there exist a set of o L and Y valid for Eqn. (la 
~ Id) when no edge links node i and node j. In sum, Eqn. (la ~ Id) show a topological ordering, that is, if node j 
comes after node i (that is, Oj > o,) in the ordering, there can not be a link from node j to node i, which guarantees 
the acyclicity. 


°i > °i , 

Oi > Oj 



■ • > o h > Oj 


Casel Case 2 

Fig. 6. Explanation of our ordering based DAG constraint. 

Now let us consider the sufficient condition, if Q is a DAG, we can obtain some topological ordering (1,2, ■ • • , to) 
from it. Let bi be the index of node i in this ordering. Setting Oj = (Oi — 1)^ (Vi £ {1, ■ ■ • , m}), we have min(oi) = 

(1 — 1) — = 0 and max(oj) = (to — 1)— < A. If node j comes after node i, we have o,- — o , > A > A 

— Y. .. If node 

\ ) rn \ 6 / \ j 777, — -j * J L — m — m 

j comes before node i, we can always set Y,, ; sufficiently large to satisfy Eqn. (la ~ Id). Therefore, from a DAG, 

we can always construct a set of ordering variables that satisfy Eqn. (la ~ Id). 

Combining the proofs above, Eqn. (la ~ Id) are the sufficient and necessary condition for a directed graph Q to 
be DAG. □ 


Proposition 2. The optimization problem in Eqn. (2) (i.e., Eqn. (4.2) in the paper) is iteratively solved by alternate 
optimizations of (i) o and Y with 0 fixed, and (ii) © with o and Y fixed. This optimization converges and the 
output 0* is DAG When X dag > ^(m-2)(n-l)N-^ 1 (2n-2-A 1 ) 


where to is the number of nodes and n is the number 
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of samples. 


m 

min Y' ||x : j - PAjdiWl + Ai||0i||i + Ad ag € i T |0 i | 

©,o,T z ' 

i=l 

s.t. Oj -Oi > — e {1, - - - ,m}, 

m 

0 < Oi < A, Ty > 0 


(2) 


Here o and T are the variables defined in the DAG constraint in Section 4.2, and 0 is the model parameters of 
SGBN. The vector e, denotes the z’-th column of the matrix Y, and 1 6, | the component-wise absolute value of the 
z-th column of 0. Other parameters are defined in Table 1 in the paper. 


Proof. In the following, we prove that: 

1) The alternate optimization in Eqn. (2) converges. 

2) The solution 0* of Eqn. (2) is DAG when A dag is sufficiently large. 


Let us denote /(©, o, Y) = Yn=\ ll x :,* “ PA^iHi + Ai||0i||i + Yag£i T \Oi\- 

First, we prove Eqn. (2) converges by showing that (i) /(©, o.Y) is lower bounded; and (ii) 
/(© (t+1) ,o( t+1 ). Y (t+1) ) < /(© (t) ,oW, Y (t) ), meaning that the function value will monotonically decrease 

with the iteration number t. 

It is easy to see that /(©, o, Y) is lower bounded by 0, since each term in /(©, o, Y) is non-negative. And the 
second point can be proven as follows. At the t-th iteration, we update 0 by 


©(t+!) = 


arg mm 
e 


Ei 

i=l 


x : , PA, 0i||2 


+ Ai||0j||i + A 


^dagC-i 


(t)T l e, 


(3) 


= argmin/(0, o^\ Y^). 


© 


It holds that /(© (t+1 \ o^, Y ( ^) < /(© ( ^, o^\ Y*^). Also it is noted that 0 (t+1 ^ is an achievable global minimum 
of 0 since /(©,o^, T'' ! ) is a convex function with respect to 0. Similarly, we then update o and Y by 


{o ( * +1) , Y (t+1) } =arg min/(0 (t+1) ,o, Y) 


(4) 


s.t. Oj 0 ^ ^ Yjj,Vi,j € {T,* 
m 




0 < Oi < A, 


Y ij > 0. 


It holds that /(0^ +1 ),o* t+1 ), Y^ t+1 ^) < /(0 (t+1 \ o^, Y^). Also, /(© ( - t+1 \ o, Y) is a linear function with respect 
to o and Y. Consequently we have 

/(0 (t+1) ,o( t+1 \ Y (t+1) ) < /(0 lt+1) ,oW,Y (i) ) < /(0 (t) , oW,Y (t) ). 

Therefore, the optimization problem in Eqn. (2) is guaranteed to converge with the alternate optimization strategy, 
because the objective function is lower-bounded and monotonically decreases with the iteration numbers. 

Second, we prove that when A dag > 2m ( m ~ 2 )( ra ~^ ^ out p U j. @* guaranteed to be DAG. This 

could be proven by contradiction. Suppose that such a A d ag does not lead to a DAG, say, Y r , x ©A ^ 0 for at least 
one pair of nodes i and j, with 0), ^ 0 and Yjj > 0. Without loss of generality, we assume Yj* > (A + l)A (where 
A is an arbitary positive number), so the ordering constraints in Eqn. (2) always hold regardless of the variables o, 

and Oj. This is because Oi and Oj are constrained by 0 < Oi < A and 0 < Oj < A, and Oj — Oi > — A = 

T 


PA 




— (Ai + \dag~Pij) > 0. Here, 


Based on the first-order optimality condition, ©T ^ 0 i.f.f. 2 
PAjAjy) denotes the elements in the matrix PA, with the j-th row removed (i.e., parents of the node i without the 


( X :,i 
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node j), and 0 ( ' u denotes the elements in 9 * without 0 *■. However, it can be shown that. 




( x :,i 


- pA h\h : ) 0 Ai 



&i\j PA *(\j)0 X: >J 

(5) 


m 


KdGj + 

X 0 fci X a- X :j 



k=l,k^i,j 



< (n — 1) + (m — 2)(n — 1) max |0Jh| 

Ai 

The second last inequality holds due to the normalization of features x : ^ (to zero mean and unit std). The last 
inequality holds because max|0^| < ||0*||i < £ (||x :>i - PA70*||| + Ai||0*||i + \da g e* T \0*\) = £/(©*, o*, Y*) < 
j^f( 0,o*, Y*) = jrx. ( x : ., = With the given A das , Eqn. (5) results in 


( x :,i 


- PA J(\j,:) 9 i\j 


— (Ai + \dag~£ij) < 0, 


which contradicts the above first-order optimality condition with 0*, ^ 0. Therefore, when A dag is sufficiently 
large, the output 0* is guaranteed to be DAG. 

Summing up the proofs above, the alternate optimization of Eqn. (2) converges and the output 0* is guaranteed 
to be DAG when A d ag is sufficiently large. □ 



